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Description 

The present invention concerns the coding of 
moving images in which a iiuman face is repre- 
sented. It is concerned to achieve low transmission 
rates by concentrating on movements associated 
with speech. The invention also permits tSie synthe- 
sis of such images to accompany real or synttietic 
speech. 

It has already been proposed (see BELL LAB- 
ORATORIES RECORD, vol. 48, no. 4, April 1970, 
pages 110-115. Murry Hill, US; F.W: MOUNTS: 
"Conditional replenishment: a promising technique 
for video transmission**) to reduce the required 
transmission rate for a moving image by comparing 
successive frames of the image and transmitting 
data only in respect of those parts of the frame 
which have charv .ed since the previous frame. The 
present invention aims to take advantage of the 
knowledge that, in transmitting an image of a face, 
the main information content lies in movements of 
the mouth. 

According to a first aspect of the invention 
there is provided an apparatus for encoding a mov- 
ing image including a human face comprising: 

means for receiving video input data: 

means for output of data representing one 
frame of the image: 

identification means arranged in operation for 
each frame of the image to identify that part of the 
input data corresponding to the mouth of the face 
represented and 

(a) in a first phase of operation to compare the 
mouth data parts of each frame with those of 
other frames to select a representative set of 
mouth data parts, to store the representative set 
and to output this set: 

(b) in a second phase to compare the mouth 
data part of each frarne with those of the stored 
set arxl to generate a codeword to be output 
indicating which member of the set the mouth 
data part of that frame most closely resembles. 

It will be appreciated that this procedure makes 
use of prior knowledge as to the nature of the 
image by identifying specifically the mouth of the 
face represented, and further takes advantage of 
the fact that the mouth can be adequately repre- 
sented by a selected representative set of mouth 
data paits. 

According to a second aspect of the invention 
ttiere is provided a speech synthesiser including 
means for synthesis of a moving image including a 
human lace, comprising; 

(a) means for storage and output of the image of 

a face; 

<b) means for storage and output of a set of 
mouth data bkx:ks (Fig. 3) each corresponding 
to the mouth area of the face and representing a 



respective different mouth shape: 

(c) an input for receiving codes identifyirig 
words or parts of words to be spoken: 

(d) sp)eech syntt>esis oceans * responsive to the 
5 codes received at the said input to syntt>esise 

words or parts of words corresponding tt>ereto: 

(e) means storing a tat>le relating such codes to 
codewords identifying said mouth data bkxks or 
sequences of such codewords: and 

10 (f) control means responsive to the codes re- 
ceived at ttie said input to se*ec^ the corre- 
sponding codeword or codeword sequence from 
the table and to output it in syrKhronism with 
synthesis of the corresponding word or part of a 
rs word by the speech synthesis means. 

According to a third aspect of the invention 
there is provided an apparatus for synthesis of a 
moving image, comprising: 

(a) means for storage and output of the irnage of 
20 a face; 

(b) means for storage and output of a set of 
mouth data blocks each corresponding to the 
mouth area of the face and representing a re- 
spective different mouth shape; 

25 (c) an audio input for receiving speech signals 
and frequency analysis means responsive to 
such signals to produce sequences of spectral 
parameters; 

(d) means storing a table relating spectral pa- 
30 rameter sequences to codewords, identifying 

mouth data bkx:ks or sequences thereof; 

(e) control means responsive to the said spectral 
parameters to select for output the corresporKl- 
ing codewords or codeword sequences from the 

35 table. 

Some embodiments of the invention will now 
be descrit)ed. by way of example, with refererK:e to 
the accompanying drawings, in which: 

Figure 1 is a bkx:k diagram of an image trans- 
40 mission system irtcluding an encoder and re- 
. ceiver according to embodiments of the inven- 
tion: 

Figure 2 illustrates an image to be transmitted; 
Figure 3 illustrates a set of mouth shapes: 
45 Figures 4. 5 and 6 illustrate masking windows 
used in face, eyes and mouth identification: 
Figure 7 is a histogram obtained using the mask 
of fig 6; 

Figures 8 and 9 illustrate t>inary images of the 
50 mouth area of an image; 

Rgures 10 and it are plan and elevatiorial 

views of a head to illustrate the effects of 

changes in orientation and: 

Figure 12 illustrates apparatus for speech analy- 
55 sis; 

Figure 13 is a block diagram of a receiver 
emtxxjying the invention. 
Figure 1 illustrates an image transmission sys- 
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tem with a transminer 1. transmission kuk 2 and. 
receiver 3. The tec^fuques empio/ed are equally 
appitcabie to recording and the transmissw linh 2 
could thus be repJaced by a tape recorder or other 
means such as a semicorwluctor store. 

The transmrtier l receives an input video signal 
from a source such as a camera. 

The movmg image to t» transmitted is the face 
5 (fig 2> of a speaker whose speech is also trans- 
mitted over the hnk 2 to the receiver. During nor- 
mal speech there is relativety bttle chartge in most 
of the area of the face - i.e. ottier than tf>e mouth 
area irKficated by the box 6 in fig 2. Therefore only 
one image of the face is trartsmitted. Moreover, it is 
fouTKf that chtf^ges in the mouth positions during 
speech can be reabsticalty represented using a 
relativety small number of different mouth positions 
selected as typical. Thus a code-book of mouth 
positions is gerwated. arKf. once this has been 
transmitted to the receiver, ttie only further informa- 
tion that needs to be sent is a sequence of 
codewords identifying the successive mouth posi- 
tions to be displayed. 

The system descrit>ed is a kfx>wledge t>ased 
system - i.e. tt>e receiver, folkxwing a "leamtr>g* 
phase is assunrted to "kr^ow" the speakers face arid 
the set of mouth positions. The operation of the 
receiver is strai gh t fo rward and involves, in ttie 
learning phase, entry of the face image into a 
frame store (from which an output video signal is 
generated by repetitive readout) and entry of the 
set of rrKXJth positions into a furtt>er •mouth" store. 
arKJ, in ttie transmission phase. usir>g each re- 
ceived codeword to retrieve the appropriate mouth 
image data and overwrite the correspor>ding area of 
the image store. 

Transmitter operation is necessarily more com- 
plex ar>d here the learning phase requires a training 
sequence from ttie speaker, as foltows: 

t) The first frame is stored arnJ transmitted, 
suitably encoded (eg using conventional redun- 
dancy reduction techniques) to tt>e receiver. 

2) The stored image is analysed in order to (a) 
identify the head of the speaker (so that the 
head in future frames may be tracked despite 
head movements), and (b) identify the mouth - 
i.e. defir*e the tx)x 6 shown in figure 2. Ttie tjox 
co-ordinates (and dimensions, if not fixed) are 
transmitted to the receiver. 

3) Successive frames of tf^e training sequence 
are analysed to track the mouth and thus define 
the current position of the box 6. and to com- 
pare the contents of the box (the "rrKHJth im- 
age") with the first and any previously selected 
images in order to buiM up a set of selected 
mouth images. This set of images (illustrated in 
fig 3) is stored at tfie transmitter and transmitted 
to the receiver. 



The transmissicn phase then requires. 

4) Successive frames are analysed (as m (3) 
above) to identify ttie position cf tf^ box 6: 

5) The content of the box m tt>e current frame is 
5 compared with the stored mouth images to ider>- 

ufy ttiat one of tt>e set which is nearest to it: the 
correspondir>g codeword is then transmitted. 
Assuming a frame rate of 2S'secorKJ arxJ a 
-codebook" of 24 mouth shapes (i.e. a 5-bit code), 
fo tfie required data rate during the transmission 
phase wouk) be 125 t>«ts'secorxf. 

The receiver display obtained using the basic 
system described is found to oe generally satisfac- 
tcy. but is somewhat unnatural prirKtpally because 
15 ) the head appears fixed and (b) the eyes remain 
unchanged (specifically, the speaker appears never 
to blink). The first of these problems may be alle- 
viated by introducing random head movement at 
the receiver: or by tracking the head position at the 
20 transmitter arxJ transmitting appropriate co-or- 
diriate.; to the receiver. The eyes coukJ be trans- 
innitted using the same principles as applied to the 
mouth; though here the size of the "codebook" 
might be much less. Similar remarks apply to the 
25 chin, and facial lir>es. 

The implementatiori of Vt\e transmitter steps 
enumerated above will now l)e considered in more 
detail, assumir^ a mor>ochrcme source image of 
• 128 X 128 pel resolution, of a head arnJ shoulders 
30 picture. The first problem is that of recognition of 
the facial features arnJ pinpointing tf>em on tfie 
face. Other problems are determir'ng the orienta- 
tion of the head and the changing shape of the 
mouth as well as the movement of the eyes. The 
35 method proposed by Nagao (M Nagao - "Picture 
Recognition and Data Structure". Graphic Lan- 
guages - ed HskB and Rosenfiekl. 1972) is sug- 
gested. 

Nagao's method invohres producing a binary 

uo representatk)n of tne image with an edge detector. 
This t)inary image is then analysed by moving a 
window down it arxJ summing the edge pixels in 
each column of the window. The output from the 
wirdow is a set of numl>ers in which the large 

45 rjmbers represent strong vertical edges. From this 
such features as the top and sides of the head. 
folk>wed by ttie eyes, nose and mouth can be 
initially recognised. 

The algorithm goes on to determine the outline 

50 of the jaw and then works t>ack up the face to fix 
the positions of nose, eyes and sides of face more 
accurately. A feedback process built into the al- 
gorithm altows for repetition of parts of the search 
if an error is detected. In this way the success rate 

55 is greatly improved. 

A program has been written using Nagao's 
algorithm which draws fixed size rectangles around 
the features identified as eyes and mouth. Details 
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of ihis (Aoqfirr^ are as fo<*o*»s 

A Lap'acian operator *s appt*ed togethof M^^t^ a 
threstwkj to give a binary image o !^e same 
resolution E'ige p«yeis become black, otners «rNte 
A window o» dimension 128 pels * 8 lines is 
positioned at the top of the l>nary image. The 
black pels m each column are summed and the 
result is stored as an entry in a 128 x 32 element 
array (ARRAY 1). The window is moved down the 
image by 4 tines each time and the process re- 
peated. The window is repositioned 32 times in all 
and the 128 x 32 element array is filled. (Fig 4), 

A search is coriducted through the rows of 
ARRAY 1 starting from the top of the image in 
order to locate the sides of the head. As the^e are 
strong vertical edges they will be identified oy high 
values in ARRAY 1 . 

The first edge located from the left side of the 
image is recorded and similarly for the right side. 
The distance between these points is measured 
(head width) and if this distance exceeds a criterion 
a search is made for activity between the two 
points which may indicate the eyes. 

The eyes are found using a one-dimensional 
mask, as illustrated in fig 5 which has two slots 
corresponding to the eyes separated by a gap for 
the nose. The width of the slots and their separa- 
tion is selected to be proportional to the measured 
head width. The mask is moved along a row within 
the head area. The numbers in ARRAY 1 falling 
within the eye slots are summed and from this 
result, the numbers in the nose skDt are subtracted. 
The final result is a sensitive indicator of activity 
due to the eyes. 

The maximum value along a row is recorded 
along with the position of the mask when this 
maximum is found. The mask is then moved down 
to the next row and the process repeated. 

Out of the set of maximum values the overall 
maximum is found. The position of this maximum 
is considered to give the vertical position of the 
eyes. Using the horizontal position of the mask 
when this maximum was found we can estimate the 
midpoint of the face. 

Next a fifteen pixel wide window, (fig 6) is 
applied to the binary image. It extends from a 
position just t^low the eyes to the bottom of the 
image and is centred on the middle of the face. 

The black pels in each row of the window are 
summed and the values are entered into a one- 
dimensional array (ARRAY 2). If this array is dis- 
played as a histogram, such features as the bottom 
of the nose, the mouth and the shadow under the 
lower lip show up clearly as peaks (Figure 7). The 
distribution of these peaks is used to fix the posi- 
tion of the mouth. 

The box position is determined centred on the 
centre of the face as defined above, and on the 



centre of the i^Oufh (row 35 tr» ' - *r*y gr#«?n 

resolution, on a suitable t>j* -'^;rt *0 t^f» 
«rtde tjy 24 high 

The r>e»t stage »s to ensure trai - fi dont;fica- 
s tion of the mouth (bow position) m rr^ first frame 
and during the learning (and transmission) phase »s 
consistent - • e. that the mouth is aiwarS cxured 
¥vithin the txjx. Application of Nagao's aigor;thm to 
each frame of a sequence in turn is ^onrxJ to ^how 
to a considerable error in registration of the mouth 
lx)x from frame to frame. 

A solution to this problem was four. J oy apply- 
ing the algorithm to the first frame only and then 
tracking the mouth frame by frame. Tfis is 
15 achieved by using the mouth m the first frame of 
the binary sequence as a template and auto-cor- 
relating with each of the successive irames «n the 
binary image referred to above. The search is 
started in the same relative position m the next 
20 frame and the mask moved by 1 pixel at a time 
until a local maximum is found. 

The method was used to obtain a sequence 
using the correct mouth but copying the rest of the 
face from the first frame. This processed sequence 
25 was run and shewed some registration jitter, but 
this error was only about one pixel, which is the 
best that can be achieved without sub-pl<el inter- 
polation. 

Typical binary images of the mouth area 
30 (mouth open and mouth closed) are shown in fig- 
ures 8 and 9. 

Only a small set of mouths from the total 
possible in the whole sequence can be stored in 
the look-up table, for obvious reasons. This re- 
35 quires the shape of a mouth to be recognised and 
whether it is similar to a shape which has occurred 
previously or not. New mouth positions would then 
be stored in the table. 

The similarity of difference of a mouth to pre- 
40 viously occurring mouths thus needs to be based 
on a quantisation process in order to restrict the 
number of entries in the table. 

The method by which this is achieved is as 
follows, all processing being carried out on 
45 grey scale mouth images rather than the biriary 
version referred to above. 

The mouth image from the first frame is stored 
as the first - initially the only - entry in a look-up 
table. The mouth image from each frame in ;he 
50 training sequence is then processed by (a) com- 
paring it with each entry in the table by subtracting 
the individual pel values and summing the absolute 
values of those differences over the mouth box 
area; (b) comparing the sum with a threshold value 
55 and, if the threshold is exceeded, entering that 
mouth image as a new entry in the table. 

However, this particular method of finding the 
sum of the absolute differences is very susceptible 
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to movement. For example, two identical images 
where the &ocond one has been shifted by just one 
pixel tu the left would produce a very low value for 
the sum, whereas these two images should be 
seen as identical. If a small degree of movement 
Witt'* the overall tracking is permitted to try to 
compensate for the fact that the sum falls off 
dra;r\atically if the image is displaced by only one 
pixel then a reduction in the size of tt>e look-up 
table can be achieved without a corresponding loss 
of mouth shapes. Thi:. can be done \i, at each 
frame, tne mouth in the current frame is compared 
three times with eacn of the code-book entries - at 
the current position, shifted to the left by one pixel, 
and shifted to the right by or.e pixel, and the 
minimum sum found in each case. The result gen- 
erating the smallest minimum sum together with 
the value of the shift in the x-dij-=^ction is recorded. 
This movement could, of course, be performed in 
both the X- and the y- directioi^s. but u has been 
found that the majority of movement is in the x- 
direction. 

If the desired table size is exceeded, or the 
number of entries acquired durmg tre training se- 
quence is substantially lower :ran t ie table size, 
then the threshold level is ad-.iistea appropriately 
and the learning phase repeated: to avoid exces- 
sive delay such conditions mi9nt be jreaicted from 
the acquisition r&«e. 

Or:.-^ thA table has been constructed, the 
transmission phase can comtr.ence. n which each 
successive mouth image is compared - as de- 
scribed in (a) atK)ve - with .«li those of the stored 
table and a codeword iaentif-. «ng tne entry which 
gave the lowest summation rt:sult ii then transrrit- 
ted. 

The computation rea^-ire^ o co this is large 
but can be decreased an ernative searc ng 
method is adopted. The simp alte rative would 
t>e instead of loOKing at ail th nouins n the look- 
up table and finding tr^^ mnn n sum to use the 
first one that has a sun wh- is less than the 
thres^i^ld. On its own, ris woi certainty t>e quic- 
ker, but would be uv. to -fer from a large 
amount ierkiness if ore n which the table 
is scanned were fixed 'ei fie oroer in which 
t:.^* table is scannec - ds je varied. A pre- 
feried variation requn- • rc J of the order in 
wiiich mouths from tn^^ de- :k appear - a sort 
of rank-ordering - to epi example, if the 
previous frame used i • ti the table, then 
one scans the table e • rnt frame starting 
with the entiy which • :Dpe --d most often after 
mouth 0 in the past. '5 ... If the sum of the 
absolute differences i -sen : r: current frame and 
mouth 5 is less than old trsen mouth 5 is 

chosen to represent • Atcr' Tame, if >t is great- 
er than the thresh.. -ovtr ilong to the next 



mouth in the code-book -fhich rzf appeared after 
mouth 0 the second most often, u wi on. V\men ^ 
mouth is finally chosen, the :eccio of which mouth 
is chosen is updated to include ti .o rurre/^t informa- 
5 tion. 

Optionally, mouth imagt^s -a^;. o a kjwest sum- 
mation rfcsult above a set value .r»c,iit be recog- 
nised as being shapes not prasent ;n the set and 
initiate a dynamic update prcoss in which an 
JO additional mou»h image is appei'd^d to the table 
and sent to the receiver during Vie transmiscon 
phase. In most circumstance;* .^anu.m;r,sion of the 
"new* mouth would t\oX tje fait enoug.t to permit 
its us© for the frame giving tk z to i:. but it would 
15 be available for future occurrences ? ih«i shape. 

Care must t>e taken in th s co^e 4 Jic set 
value is not too low because thi-s car rasult in new 
• nouths being placed ir.fi the 'ook-up table all the 
way through the sequence. A.n this is no more 
20 than image sub-san.pling wh.ci >g>»I:: obviously 
produce a reasonable result but a ich would need 
a code-book whose size is p: ..♦iiit.al to the 
length of the sequence being p' v:^sMec. 

The set value can be arr ve-. « ; vo»jgh trial 
25 and error. It would obvk)usly dcsi'.^ble if this 
threshold could be selected ajtcnauc- iy. or dis- 
pensed with altogether. The suio of n absolute 
differences between frames is always a positive 
measure, and the look-up tab! ^ therefore repre- 
30 sents a metric space. Each mo«..» ii: :he look-up 
tabto can be thought of as oxiiting in a multi- 
dimensional metric space, and each frame in a 
sequence lies in a cluster around one of these 
codebook mouths. Various algo it'^ms such as the 
35 Linde-Buzo-Gray exist which could be used to find 
thr^ optimum set of mouths. These algorithms use 
the set of frames in the sequence as a training set 
and involve lengthy searches to minimise the error 
and find the optimum set. Preferable to this is to 
40 find a "representative" set of mouths virhich are 
sub-optimal, but which can t>e found more quickly 
than the optimum set In order to do this it is 
necessary to specify the number of mouths that 
are to be used, and then to select the required 
45 number of mouths from the training sequence. The 
look-up table can still be updated during the trans- 
mission phase using the same algorithm as for 
training, but the total number of mouths in the table 
will remain constant. 
50 The selection of mouths follows a basic rule - if 

the minimum distance (distance can be used since 
it is a metric space) between the current frame and 
one of the mouths in the table is greater than the 
minimum distance between that mouth in the table 
55 and any other mouth in the table then the current 
mouth should be included in the table. If it is less, 
then that mouth is simply represented by the near- 
est mouth in the table. When a new mouth is to be 
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included in the table during a transmission phase 
then the nr>outh that has to be renrwv&d is selected 
according to ttie following rule - find Itw pair of 
mouths in the look-up table that are closest to- 
gether anH throw one of them away, preferably the 
one that is nearest to ttie new mouth- 
When a new mouth is entered in the table, then 
clearly it has no past history with which to order 
tne other mouths in ft a code-txx>k - each will never 
have appeared after this new mouth. When the . 
next frame in the sequence is encountered, the 
look-up table would be scanned in order, arriving at 
the new entry last. However, this new entry is the 
most likely choices, since mouths tend to appear in 
clumps, particularly just after a new mouth has 
t>een created. So the ordering Is adjusted so that 
the new mouth is scanned first. 

The above-described transmission system 
might be employed in a picture-phone system em- 
ploying a standard telephone link; to allow for the 
learning phase, the image would not immediately 
appear at the receiver. Following the initial delay - 
perhaps 15 seconds assuming non-digital transmis- 
sion of the face - the moving picture would be 
transmitted and displayed in real time. 

A fixed mouth overlay can be used on a face 
orientated differently from the forward facing posi- 
tion if the difference is not too large. Also. it. is 
clear that In order to show general movements of 
the head such as nodding and shaking one must 
display the face as seen from a numt>er of different 
angles. A displayed head is unconvincing unless 
there is some general movement, if only random 
movement. 

In a system such as the one described, dif- 
ferent views of the face would have to be transmit- 
ted and stored at the receiver. If a complete set of 
data were sent for every different face position this 
would require excessive channel and storage 
capacities. A possible way around the problem is 
shown in Fig 10. 

The appearance of the face in the frontal posi- 
tion is represented by the projection (x1-x5) in 
plane P. If the head, is turned slightly to one side 
its anpsarance to the observer will now be repre- 
sented by (xl'-x5*) in plane P\ If the illumination of 
the face is fairly Isotropic then a two dimensional 
transformation of (xl-x5) should be a close approxi- 
mation to (xV-x5*). 

The important differences would occur at the 
sides of the head where new areas are revealed or 
occluded and, similariy. at the nose. Thus by trans- 
mitting a code giving the change in orientation of 
the head as well as a small set of differences, the 
whole head could be reconstructed. , The differ- 
ences for each head position could be stored and 
used in the future if the same position is identified. 
The concept of producing pseudo-rotations by 



2-0 transformation is illustrated with reference to 
the "face* picture of Figure 1 1 . 

To simulate the effect of vertical axis rotation in 
a directkxi such that the nose moves by a dis- 
5 placement S from left to right (as viewed): 

(1 > Points to the left of (XI -XV) do not move. 
(2) Points on the line (X2-X2*) move to the right 
with displacements S/2. (Region (XI .XV .X2,X2*) 
is stretched accordingly). 
TO (3) Points on the line (X3-X3') move to the right 
with displacement S. (Region X2.X2'.X3.X3') is 
stretched). 

(4) Points on the line (X4-X4*) moves to the right 
by displacement S. (Region (X3.X3'.X4.X4') is 

f 5 Uanslated to right). 

(5) Points on the line (X5-X5')_move to the right; 
displacement S/2. (Region (X4.X4',X5,X5*) is 
shrunk). 

(6) Points to the right of the line (X6-X6') do not 
20 move. (Region X5.X5'.X6,X6' is shrunk). 

Two-dimensional graphical transformations 
could be used in a system for a standard videocon- 
ferencing application. In this system, human suk>- 
jects would be recognised and isolated from non- 
25 moving foreground and background objects. Fore- 
ground and t)ackground would be stored in mem- 
ory at different hierarchical levels ar'-^ding to 
whether they were capable of occluding moving 
objects. Relatively unchanging moving bodies such 
30 as torsos would be stored on another level as 
would more rapidly changing parts such as the 
arms and head. 

The principle of operation of the system would 
require the transmission end to identify movement 
35 of the various segmented parts and send motion 
vectors accordingly. These would be used by the 
receiver' to form a prediction for each part in the 
next frame. The differences between the prediction 
and the true picture would be sent as in a standard 
40 motion compensation system. 

The system should achieve high data compres- 
sion without significant picture degradation for a 
numk)er of reasons: 

1) If an object is occluded and then revealed 
45 once more the data does not have to be retrans- 
mitted. 

2) For relatively unchanging bodies such as 
torsos a very good prediction could be formed 
using minor graphical transformations such as 

50 translations and rotations in the image plane and 
changes of scale. The differences between the 
prediction and the true should be small. 

3) For the more rapidly moving objects a good 
prediction should still be possible although the 

55 differences wouU be greater. 

4) It could ueat subjectively important features 
in the scene differently from the less important 
features. For instance, faces could be weighted 
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more heavily than rapidly moving arms. 
A second embodiment of the invention relates 
to the synthesis of a moving picture of a speaker to 
accompany synthesised speech. Two types of 
speech synthesis will t>e considered: 

(a) Limited vocatMjIary synthesis in which digitis- 
ed representations of complete words are stored 
and the words are retrieved unde" control of 
manual, computer or other input and regener- 
ated. The manner of storage, whether PCM or 
as formant parameters for example does not 
affect the picture synthesis. 

(b) Allophone synthesis in which any word can 
t>e synthesised by supplying codes representing 
sounds to be uttered; these codes may be gen- 
erated directly from input text (text to speech 
systems). 

In either case there are two stages to the face 
synthesis; a learning phase corresponding to that 
descrit>ed above, and a synthesis phase in which 
the appropriate mouth codewords are generated to 
accompany the synthesised speech. 

Considering option (a) first, the speech vocabu- 
lary will usually be generated by recording the 
utterances of a native speaker and it will often be 
convenient to use the face of the same speaker. If 
another face is desired, or to add a vision facility to 
an existing system, the substitute speaker can 
speak along with a replay of the speech vocabu- 
lary. Either way the procedure is the same. The 
learning phase is the same as that described 
above, in that the system acquires the required 
face frame and mouth look-up table. However it 
must also record the sequence of mouth position 
codewords corresponding to each word and store 
this sequence in a further table (the mouth code 
table). It is observed that this procedure does not 
need to be carried out in real time and hence offers 
the opportunity of optimising the mouth sequences 
for each word. 

In the synthesis phase input codes supplied to 
the synthesiser are used not only to retrieve the 
speech data and pass it to a speech regeneration 
unit or synthesiser but also to retrieve the mouth 
codewords and transmit these in synchronism with 
the speech to a receiver which reconstructs the 
moving pictures as described above with reference 
to figure 1. Alternatively the receiver functions 
coukJ be carried out locally, for local display or for 
onward transmission of a standard video signal. 

In the case of (b) allophone synthesis, a real 
face is again required and the previously described 
learning phase is carried out to generate the face 
image and mouth image table. Here however it is 
necessary to correlate mouth positions with individ- 
ual phonemes (ie parts of words) and thus the 
owner of the face needs to utter, simultaneously 
with its generation by the speech synthesiser, a 



representative passage including at least one ex- 
ample of each allophone which the speech syn- 
thesiser is capable of producing. The codewords 
generated are then entered into a mouth look-up 
5 table in which each entry corresponds to one al- 
lophone. Most entries will consist of more than one 
codeword. In some cases the mouth positions cor- 
responding to a given phoneme may vary in de- 
pendence on the preceding or following phonemes 
JO and this may also be taken into account. Retrieval 
of the speech and video data takes place in similar 
manner to that descrit)ed above for the "whole 
word" synthesis. 

Note that in the "synthetic speech* embodi- 
15 ment the face frame, mouth image table and mouth 
position code words may, as in the transmission 
system described above be transmitted to a re- 
mote receiver for regeneration of a moving picture, 
but in some circumstances, eg a visual display to 
20 accompany a synthetic speech computer output, 
the display may be local and hence the "receiver" 
processing may be carried out on the same ap- 
paratus as the table and codeword generation. Al- 
ternatively, the synthesised picture may be gen- 
25 erated locally and a conventional video signal 
transmitted to a remote monitor. 

The question of synchronisation will now be 
considered further. 

A typical text-to-speech synthesis comprises 
30 the steps of: 

(a) Conversion of plain text input tc phonetic 
representation. 

(b) Conversion of phonetic to lower phonetic 
representation. 

35 (c) Conversion of lower phonetic to formant pa- 
rameters; a typical parameter update period 
would k>e 10ms. 
This amount of processing involves a degree of 
delay; moreover, some conversion stages have an 
40 inherent delay since the conversion is context de- 
pendent (e.g. where the sound of a particular char- 
acter is influenced by those which follow it). Hence 
the synthesis process involves queueing and timing 
needs to be carefully considered to ensure that the 
45 synthesised lip movements are synchronised with 
the speech. 

\/Vhere (as mooted al)ove) the visual synthesis 
uses the allophore representation as input data 
from the speech synthesiser, and ij the speech 

50 synthesis process from that level downward in- 
volves predictable delays then proper timing may 
be ensured simply by introducing correspofKJing 
delays m the visual synthesis. 

An alternative proposal is to insert flags in the 

55 speech representations. This could permit the op- 
lion of programming mouth positions into the 
source text instead of (or in addition to) usmg a 
lookup table to generate the mouth positions from 
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the allophones. Either way, flags indicating the pre- 
cise instants at which mouth positions change 
could be maintained, in the speech representations 
down to (say) the lower phonetic level. The speech 
synthesiser creates a queue of lower phonetic 
codes which are then converted to formant param- 
eters and passed to the formant synthesiser hard- 
ware: as the codes are "pulled off" the queue, 
each flag, once the text preceding it has l>een 
spoken, is passed to the visual synthesiser to syn- 
chronise the corresponding mouth position change. 

A third embodiment of the invention concerns 
the generation of a moving face to accompany real 
speech input. 

Again, a surrogate speaker is needed to pro- 
vide the face and the learning phase for generation 
of the mouth image table takes place as before. 
The generation of the mouth code table depends 
on the means used to analyse the input speech; 
however, one option is to employ spectrum analy- 
sis to generate sequences of spectral parameters 
(a well known technique), with the code table serv- 
ing to correlate those parameters and mouth im- 
ages. 

Apparatus for such speech analysis is shown in 
Figure 12. Each vowel phoneme has a distinct 
visual appearance. The visual correlate of tfie au- 
ditory phoneme is called a viseme [K W Berger - 
Speechreading: Principles and Methods. Baltimore: 
National Educational Press. 1972. p73-107]. How- 
ever many of the consonants have the same visual 
appearance and the most common classification of 
consonant visemes has only 12 categories. This 
means that there will be no visible error if the 
system confuses phonemes k>elongir>g to the same 
» category. Since there is less acoustic energy gen- 
erated in consonant formation than vowel formation 
it would be more difficult for a speech recognf«>er 
to distinguish t>etween consonants. Thus the many 
to one mapping of consonant phonemes to con- 
sonant visemes is fortuitous for this system. 

A method of analysing speech wouk) use a 
filter bank 10 with 14-15 chanr>els covering the 
entire speech range. The <scoust»c energy in eacn 
channel is integrated ubmg a leaky integrator 1 1 
and the output sampled 12 at the video frame rjte 
(every 40m8). A subiect is required to pionounro 
during a traming sequence a full sot of phoneme 
sounds and the filter bank analyses iho speech. 
Individual speech sounds a/e tdoniifted by threshol- 
ding the energy over each of V(»<np(os The 
sample vaJ*fe5 a/e stored <n 3 s<jt of morr^o'v 
locations 13 »»h»ch a/o iat>ono*j *»th !r>o approp»»a!e 
phoneme name. These fo<m a sci of templates 
Mhicn subsequently aio usod to ideniify prtonorru)) 
tn an unknown speocn signal from !h«> -:-ime suo- 
This IS done t)y us-'^g tn^j utus* oank to an- 
alyse t*^ unkno«^n !iptiti<n ,it Itv* :>afn«s ^ainpfmg 



rate. The unknown speech sample is compared 
with each of the templates by summing the 
squares of the differences of the corresponding 
components. The best match is given by the small- 
5 est difference. Thus the device outputs a code 
corresponding to the best phoneme match. There 
would also be a special code to indicate silence. 

While the subject uttered the set of phonemes 
during the training sequence a moving sequence of 
70 pictures of the mouth area is captured. By pinpoint- 
ing the occurrence of each phoneme the corre- 
sponding frame in the sequence is located and 1 
subset of these frames is used to construct a coo 
book of mouths. In operation a look-up table > 
15 used to find the appropriate mouth code from '.*e 
code produced by the speech analyser. The cc -i 
denoting silence should, invoke a fully close.) 
mouth position. A synthetic sequence is created bv 
overlaying the appropriate mouth over the face ut 
20 video rate. 

As with the case of synthesised speech, the 
"receiver" processing may t>e local or remote, in 
the latter c<>se. it is proposed, as an additional 
modification that the mouth image table stored at 
25 the transmitter might contain a targer numt>er of 
entries than is normally sont to tho receiver, "'his 
would enabi** the table to include mouth shapes 
which, in general, occur only rarely, but may occur 
frequently in certain types of speech: for example. 
30 shapes which correspond to sounds which occur 
only m certain regional accents. Recognition o' the 
spectral parameters corresponding to such a sound 
would then initiate the dynarrsrc 'update process 
referred to earlier to make the relevant mouth 
35 shape(s) available at the receiver. 

The construction of aopropnate display 
(receiver) arrangements for ire above proporals 
will now t)e further considered (soe figure 13) 

A frame store lOO »s pfOvid«xJ. into wfiicn v.ur- 

40 mg the learning phase the roceivod still fiamo 
entered from an input decoder 10 » ^hiiit "rTw>i.tn" 
store 102 stores the dosirod r.umbe* (say 25t 
mouth positions Readout kxjic i03 'epoaiortiy 
roads the contents of tho frame store and adds 

41 syrKhronisif^ pulses to food a vi<3oO monitor 1C4 
In trxj transmission phaso. foco»v*>i ':o<kr*o*'ls a/o 
supplied to a control unit 105 r.ont/o<s o-^^it- 
ttrtXinQ of the r*»<ovanl aroa of '.r>o »»anvj ttci lOi 
Mitn Iho COr»ospond»r>g m»x;lh -.Jr*n ontry CkiSf^f. 

•A this ovofwfitir^g hoods to !/«» »ac:*'J v> as w^-i to r.o 
visible to tho.wio*»»j* ff^iso *>fio»::s «x*fc3 tm '«•• 
«Jl;«:ix1 t)y div^irvj !ho i;(x:aro 4*«ta '«"to "jr^.a*» 
blocks aiXJ ov<*'*n!ing n a t ^r<W.m or c;*o»:jrir^3 
noO'SOquonlial mdnrMjr Ajto^oatriOiy •* if.srr^ 

•.s store aichitocturo «r.<:ii/<J«»s wtrwyi^rs .r tc."fos :r*ir-. 
':hoso couki Oo p««i<*>a»VxJ ••»th 'r*> txtt^o ^^iat«r-. 
st^ii 3*ttcho«1 ar*-: ^<.t <tna\n To 2^C-<^ aio 
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simplify the process by t/t.pioymg >: - y shifting of 
the window snrite. 

Claims 

1. An apparatus for encoding a moving image 
including a numan face (5) comprising: 

means (i) for receiving video input data: 
means for output of data enablit.g one 
frame of the image to be reproduced: 

identification means arranged in operation 
for e?ch frame of the image to identify that 
part of the input data corresponding to the 
mouth (6) of the face represented and 

(a) in a first phase of operation to compare 
the mouth data parts of each frame with 
those of other frames to select a repre- 
sentative set (Fig. 3) of mouth data parts, to 
store the representative set and to output 
this set: 

(b) in a second phase to compare the 
mouth data part of each frame with those of 
the stored set and to generate a ^deword 
to be output indicating whicr memoer of the 
set the mouth data part of that frame most 
closely resembles. 

2. An apparatus according to clairr ? in wnich th^ 
iv^entification means is arranged in roeration 
firstly to identify that part of one frame of input 
data corresponding to the mouth of tne face 
reprboanted and to identify the mouth part of 
successive frames by auto-corretation with 
data from the said one frame. 

3. An apparatus according to claim 1 or 2 ar- 
ranged in operation during the first phase to 
store a first mouth data part and then for the 
mouth data parts of each successive frame to 
compare it with the first and any other stored 
mouth data oart and *f the result of the com- 
parison exceeds a threshold value, to store 
and output it. 

4. An apparatus according to claim 1. 2 or 3 in 
which the comparison of mouth data is carried 
out by subtraction of individual picture element 
value:, and summing the absolute values of the 
differences. 

5. .^n apparatus according to ciaim i. 2. 3 or 4 
including means for obtaining the coordinates 
of the position of the face witnin successive 
frames of the image and generating coded 
dat:i representing those coordinates. 

6. An apparatus according to ar.v one of the 
pieceding claims, in which during the second 



phase in the event that the result of the com- 
parison between a mouth data part and that 
one of the set which it most closely resembles 
exceeds a predetermined threshold, that data 
5 part is output and stored as part of the set. 

7. An apparatus according to any one of the 
preceding claims further including identification 
means arranged in o)Deration for each frame of 

TO the image to identify that part of the input data 

corresponding to the eyes of the face repre- 
sented and 

(a) in the first phase of operation to com- 
pare the eye data parts of each frame with 

/5 those of other frames to select a repre- 

sentative set of eye data parts, to store this 
representative set and to output the said 
set: 

(b) in the second phase to compare the eye 
20 data part of each frame with those of the 

stored set and to generate a codeword in- 
dicating which memt>er of the set the eye 
data p>art of that frame most closely resem- 
bles. 

8. A speech synthesiser including means for syn- 
thesis of a moving image including a human 
face, comprising; 

(a) means for storage and output of the 
30 image of a face: 

(b) means for storage and output of a set of 
mouth data blocks (Rg. 3) each correspond- 
ing to the mouth area of the face and repre- 
sentirtg a respective different mouth shape; 

35 (c) an input for receivtrtg codes identifying 

words or parts of words to be spoken; 

(d) speech synthesis moans responsive to 
the codes received at the said input to 
synthesise words or parts of words cor- 

40 responding thereto: 

(e) means storing a table relating such 
codes to codewords identifying said mouth 
data bkxks or sequences of such 
codewords; and 

45 (f) control means responsive to the codes 

received at the said input to select the cor- 
responding codeword or codeword se- 
quence from the table and to output it in 
synchronism with synthesis ct ;he corre- 

50 sponding word or pan • ? -. ^uid by the 

speech synthesis means. 

9. A synthesiser according to .'aim 8 in which 
the speech synthesis meani mr'.-.ies means 

55 arranged in operation lor prc^r^sr- x and queu- 

ing the input codes, the qi--/" .rt-;''jding flag 
codes indicating changes ir n.ou:!^ shape, and 
lesponsive to each flag coo- . ir^.^fmit to the 
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control means, after the speech synthesiser 
has generated the speech represented by the 
input code preceding that flag code in the 
queue, an indication whereby the control 
means may synchronise the codeword output 
to the synthesised speech. 

10. An apparatus for synthesis of a moving image, 
comprising: 

<a) means for storage and output of the 
image of a face: 

(b) means for storage and output of a set of 
mouth data blocks each corresponding to 
the mouth area of the face and representing 
a respective different mouth shape: 

(c) an audio input for receiving speech sig- 
nals and frequency analysis means (10, 11, 
12) responsive to such signals to produce 
sequences of spectral parameters; 

(d) means (13) storing a table relating spec- 
tral parameter sequences to codewords 
identifying mouth data blocks or sequer^es 
thereof; 

(e) control means responsive to the said 
SF>ectral parameters to select for output the 
corresponding codewords or codeword se- 
quences from the table. 

11. An apparatus accc^ding to claim 8, 9 or 10 
further including frame store means (100) for 
receiving and storing data representing one 
frame of the image; 

means (103) for repetitive readout of the 
frame store to produce a video signal; and 

control means (105) arranged in operation 
to receive the selected codewords and in re- 
sponse to each codeword to read out the cor- 
responding mouth data block and to effect 
insertion of that data into the data supplied to 
the readout means (103). 

Revendlcations 

1. Un appareil destine ^ encoder une image mo- 
bile comprenant un visage humain (5) compre* 
nant: 

un moyen (1) pour recevoir des donnees 
vid^o d'entree: 

un moyen pour sortir des donnees permet- 
tant h ur\e tra>ne de Ttmage d'etre reproduite: 

un moyen d'identification agenc^ en fonc- 
tionnement pour chaque frame de Timage pour 
identifier r^l^ment des donnees d*entr§e cor- 
respondant ^ la tx)uche (6) du visage repr^- 
sente et 

(a) pour comparer, dans une premiere pha- 
se de fonctionnement. les elements de don- 
nees de la ho«tc^'e de chaque frame avec 



ceux d'autres frames pour choisir un jeu 
representatif (Fig. 3) d'eiements de donnees 
de tXHiChe. pour m^moriser le ieu represen- 
tatif et sortir ce jeu: 

5 (b) pour comparer, dans une deuxi^me pha- 

se, reiement de donnee de txxjche de cha- 
que frame avec celles du jeu memorise et 
engendrer un mot de code ^ sortir indiquant 
h quel element du jeu I'^l^nnent de donnas 

to de bouche de cette trame ressemble le plus 

6troitement- 

2. Un appareil selon la revendication 1. dans le- 
quel le moyen d'identification est agence en 

15 fonctionnement pour identifier en premier lieu 

rei^ment d'une premiere trame de donnee 
d'entr^e correspondant h la bouche du visage 
represent^ et pour identifier r^ldment de tx)u- 
che de frames successives par une autocorr^- 

20 lation avec les donnees provenant de ladite 

premi&re trame. 

3. Un appareil selon la revendication 1 ou 2 
agenc^ en fonctionnement pour m^moriser, 

25 pendant la premiere phase, un premier Ali- 

ment de donnee de bouche et pour comparer 
ensuite chacun des elements de donnees de 
bouche de chaque trame successive avec le 
premier et avec tout autre 6l6ment de donnee 

30 de bouche m6moris6 et pour le m^moriser et 

le sortir si le resultat d^passe une valeur de 
seuil. 

4. Un appareil selon la revendication 1. 2 ou 3. 
35 dans lequel la comparaison de donr>^s de 

bouche est effectu^e par soustraction de va- 
lours individuelles d'^l^ments d'tmage et addi- 
tion des valours absolues des differences. 

40 5. Un appareil selon la revendication 1 . 2. 3 ou 4, 
comprenant un moyen pour obtenir les coor- 
donn^es de la position du visage S rint^rieur 
de frames successives de Timage et engen- 
drer des donnees codees repr^sentant ces 

45 coord on nees. 

6. Un appareil selon Tune quelconque des pr#c§- 
dentes revendications. dans lequel. pendant la 
deuxi^me phase, dans le cas ou le resultat de 

50 la comparaison entre un element de donnee 

de bouche et I'element du jeu qui lui ressem- 
ble le plus etroitement d^passe un seuil prede- 
termine, cet element de donnee est sorti et 
memorise en tant qu*eiement du jeu. 

55 

7. Un appareil selon Tune quelconque des prece- 
dentes revendications comprenant en outre un 
moyen d^identification agence en fonctionne- 
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ment poor chaque trame (Je rimage pour iden- 
tifier i'^l^ment de la donn^ d'entree corres- 
Dondant aux yeux du visage rep^sent^ et 

(a) pouf comparer, dano ur^e premise pha- 
se de foTKr^onnement. les ^l^ments de dorv 
nee d'yeux de cheque trame avec ceux 
d*autre$ trames pour choisir un jeu repr^- 
sematif d*^l6mems de dorw>^ d'yeux. afin 
de m6fnoriser ce jeu representatif et de 
sortir ledit jeu; 

(b) pour comparer, dans une deuxifeme pha- 
se, r^l^ment de donn^e d'yeux de chaque 
trame avoc ceux du jeu m^moris^ et pour 
ertgendrer un mot de code indiquant h quel 
^l^ment du jeu r^l^ment de donn^ d'yeux 
de cetle trame ressemble le plus etroite- 
ment. 

a Un synth^tiseur de parole tncluant des r.ioyens 
pour la synthase d'une image mobile compre- 
nant un visage humain comprenant: 

(a) un moyen de memorisation et de sortie 
de rimage d*un visage; 

(b) un moyen de memorisation et de sortie 
d*un jeu de blocs de donnee de txMJChe 
(Fig. 3) correspondant chacun a la zone de 
bouche du visage et representant une forme 
respective diff^rente de bouche; 

(c) une entree destin^e h recevoir des co- 
des identifiant des mots ou des elements 
de mot h dire: 

(d) des moyens de syntfi^se de la parole 
sensibles aux codes regus ^ ladite entree 
pour synth^tisef des mots ou des elements 
de mots qui leur corresporKJent: 

(e) des moyens memorisant un tableau re- 
liant ces codes h des mots de code identi- 
fiant lesdits blocs de donntes de bouche ou 
des sequences de ces dits mots de code: 
et 

(!) un moyen de commando sensible aux 
codes regus h ladite entree pour choisir le 
mot de code ou la sequence de mots de 
code correspondant dans le tableau et pour 
le sortir en synchronisme avec la synth&se 
du mot ou de rei^ment de mot corresporv 
dant par le moyen de synthase de la parole. 

9. Un syntti^tiseur selon la revendication 8 dans 
lequel le moyen de synthase de la parole 
comprend un moyen agenc^ en fonctionne- 
ment pour traitor et mettre en file les codes 
d'entree. la file comprenant des codes de dra- 
peau indiquant des variations de la forme de 
bouche. et sensible h chaque code de drapeau 
pour transmettre au moyen de commafKJe. 
apres que le synthetiseur de parole a engen- 
dre la parole representee par le code d'entree 



precedant ce code ce drapeau dans la nie. une 
indication gr^ce ^ laQcieile moyen oe com- 
mando peut synch/ oniser la sortie du mot de 
code avec la parole synthetisee. 

5 

10. Un appareil destine h synthetiser une image 
mobile comprenant: 

(a) un moyen de memorisation et de sortie 
de rimage d'un visage; 

ro (b) un moyen de memorisation et de sortie 

d'un jeu de blocs de donr>ees de bouche 
correspondant chacun d la zone de tx)uche ' 
du visage et representant une forme res- 
pective difterente de lx)uche: 

rs une entree audio pour recevoir des si- 

gnaux de parole et un moyen d'analyse de 
frequence <10. 11. 12) sensibles h de tels 
signaux pour produtre des sequences de 
param&tres spectraux; 

20 (d) uh moyen (13) memorisant un tableau 

reliant des sequences de parametres spec- 
traux k des mots de code identifiant des 
blocs de donnees de bouche ou des se- 
quences de ceux-ci; 

25 (e) un moyen de commando sensible aux- 

dits parannietres spectraux pour choisir. afin 
de tes sortir h partir du tableau, les mots de 
code ou les sequences de mots de code 
correspondants. 

30 

11. Un appareil selon la revendication 8. 9 ou 10 
comprenant en outre un moyen (100) de me- 
morisation de trame pour recevoir et memori- 
ser les donnees representant une trame de 
35 rimage; 

un moyen (103) pour lire de fag on repetiti- 
ve la nr>emoire de trame pour produire un 
signal video: et 

un moyen de commando (105) agence en 
40 fonctionnement pour recevoir les mots de code 

choisis et pour lire, en reponse h chaque mot 
de code, le bloc correspondant de donnees de 
bouche et pour effectuer I'insertion de cette 
donnee dans «a donnee foumie au moyen de 
45 lecture (103). 

Patentansprtiche 

1. Gerat zum Kodieren eines sich bewegenden 
50 BiWes einschlieBlich eines menschlichen • Ge- 

sichts (5), welches aufweist: 

eine Einrichtung (l) zum Empfangen von Vi- 
deoeingabedaten: 

55 

eine Einrichtung zur Datenausgabe. welche es 
gestattet. einen Datenblock des Bildes wieder- 
herzusteilen: 
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e»ne »m Betneb fur ^eden Oatenbloch des B^: 
cJes angeofdnote Wentifjkationsetnnchtung ^om 
kJentifizieren des TeiJes der Emgabedalen. 
weicHe dem Mund (6) des dargesieuten Ge- 
stchts entsprechen und 

(a) um in einer crsten Betriebsp^ase dte 
Munddatentejie jedes Oatenblocks mit de- 
nen anderer Oaienblocke zu vergleichen. 
um einen reprasentativen Satz (Fig. 3) von 
Munddatenteilen auszuwahlen. den repra- 
sentattven Sau zu speichorn und diesen 
Satz auszugeben: 

(b) um in etner zweiten Phase die Mundda- 
tenteile jedes Datenblocks mit denen des 
gespeicherten Satzes zu vergleichen ur i 
zum Erzeugen eines auszugebenden Code- 
worts, welches anzeigt, welchem Element 
des Satzes die Munddatenteile dieses Da- 
tenblocks am rineisten Shneln. 

2. GerSt nach Anspruch 1 . in welchem die identi- 
fikationseinrichtung im Betrieb angeordnet ist. 
um ais erstes denjenigen Teil eines Daten- 
blocks von Eingabedaten zu idenlifizieren. der 
dem Mund des dargestellten Gesichts ent- 
spricht und zum tdentifizieren des Mundteils 
von nachfolgenden Datenblocken durch Auto- 
korrelatton mit Daten des einen Datenblocks. 

3. Gerat nach Anspruch 1 oder 2. welches ange- 
ordnet ist um im Betrieb wahrend der ersten 
Phase einen ersten Munddatenteil zu spei- 
chern und dann fur die Munddatenteiie jedes 
nachfolgenden Datenblocks ihn mit dem ersten 
und jedem anderen gespeicherten Munddaten- 
teil zu vergleichen. und. falls das Ergebnis des 
Vergleiches einen Schwellenwert uberschreitet. 
ihn zu speichern und auszugeben. 

4. Gerat nach Anspruch 1. 2 Oder 3. in welchem 
der Vergleich von Munddaten durch Subtrak- 
tion individueiter Bildelementwerte und Sum- 
mieren der absoluten Werte der Differenzen 
durchgefuhrt wird. 

5- Gerat nach Anspruch 1. 2. 3 oder 4 einschliefl- 
lich einer Einrichtung zum Erhalten der Koordi- 
naten der Position des Gesichts innerhalb 
nachfolgender Datenblocke des Bildes und Er- 
zeugen kodierter Daten. welche diese Koordi- 
naten darsteilen. 

6. Gerat nach einem der vorhergehenden An- 
sprOche. in welchem wahrend der zweiten 
Phase in dem Fatle. daB das Ergebnis des 
Vergteichs zwischen einem Munddatenteil und 
demjenigen des Satzes. welchem es am mei- 



schrefief. dievjf Oaf<5r.to«i a» v;*^.H-jer>^r\ 4t^ , 
om Tetl des Sat...iS g^s^o*)*'-^" ''* *'"'d 

s 7. Ger^t nach o«nem rjo' vc^e^'jo^efvien Ai^- 
spfuche. weiterhm mit e«fw ider.tifikat»or.fi^'^- 
nchtung. welche angeordnet nt. tm Betr>et> 
jeden Datenbkxk des Bildes der.|oniQen T*».i 
der Eingabedaten zu tdentifizieren. der den 

10 Augen des dargestellten Gesict rs ent«p*tcnt. 

und 

(a) in der ersten Betriebsphase die A ';-;n- 
datenteile jedes Datenblocks m»t dene«i an- 
derer Datenbldcke zu vergleichen. um emen 

;5 reprasentativen Satz von Augendaten'oilen 

auszuwShlen. diesen reprasentativen Satz 
zu speichern und den Satz auszugeben; 

(b) in der zweiten Phase den Augendaten- 
teilen jedes Datenblocks mit denen des ge- 

20 speicherten Satzes zu vergleichen jnd ein 

Codewort zu erzeugen. welches angibt. wel- 
chem Element des Satzes der Augendaten- 
tei'l dieses Datenblocks am meisten ahnelt. 

25 a Sprachsynthetisator. welcher eine Einrichtung 
zur Synthese eines sich bewegenden Biides 
beinhattet einschliefilich "eines menschlichen 
Gesichts. wobei der Synthetisator aufweist: 

(a) eine Einrichtung zum Speichern und 
30 Ausgeben des Bildes eines Gesichts: 

(b) eine Einrichtung zur Speicherung und 
Ausgabe eines Satzes von Munddatenblok- 
ken (Fig. 3). deren jede dem Mundgebiet 
des Gesichts entsprechen und eine jeweili- 

35 . ge unterschiedliche Mundform darsteilen: 

(c) eine Eingabe zum Empfangen von Co- 
des, welche Worte oder Teile von zu spre- 
chenden Worten identifizieren: 

(d) eine Sprachsyntheseeinrichtung. welche 
40 auf don an der Eingabe empfangenen Code 

anspricht. ijm Worte oder dazu entspre- 
chende Teile von Worten zu synthetisieren; 

(e) eine Einrichtung. die eine Tabelle spei- 
chert, welche derartige (^des mit Codewor- 

45 ten in Beziehung setzt. welche die Mundds- 

tenblocke oder Sequenzen derartiger Code- 
worte identifiziert; und 

(f) eine Steuereinrichtung. welche auf die an 
der Eingabe empfangenen Codes anspricht. 

50 um das entsprechende Codewort oder die 

Codewortsequenz von der Tabelle auszu- 
wahlen und sie synchron mit der Synthese 
des entsprechenden Wortes oder Teiles ei- 
nes Wortes von der Sprachsyntheseeinrich- 

55 tung auszugeben. 

9. Synthetisator nach Anspruch 8. in welchem die 
Sprachsyntheseeinrichtung eine Einrichtung 
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beinhaitet. die angsordnet ist, urn im Betrieb 
die Einoabecodes zu verarbeiten und in War- 
teschlangen emzureihen. woDei die Wartesch- 
lange Kennzeichencodes enttiait. weiche An- 
derungen in de. Mundlorm anzeigen. uno in 6 
Antwort aut jeden Kennzeichencode zum Sen-^ 
d«n t>.;.er Anzeige an die Steuereinncmung. 
nachdem der Sprachsynthetisator die Sprache 
erzeuyt hat. weiche durch den Eingabjcode 
dargestiillt wild, der rtem K3nnzeichencode in >o 
der Warteschiange vorausgeht. wobei die 
Steuereinrichtung t as an die synmetisierte 
Sprache ausgegebene Codewort synchronisie- 
ren kann. 

10. GerSt zur Synthese eines s^ch bewegenden 
Bildes, wobei -ias Gerat aufweist: 

(a) eine Einrichtung zum Speichern und 
Ansgeben des Bildes eines Gesichts; 

(b) eine Einnchtung zum SjDeichern und zo 
Ansgeben eines Satzes oji r/jnddaten- 
bl5cken, die jeweils dem Mundgebiet des 
Gesichts entsprechen und erne jeweilige un- 
tercchiedliche Mundform csfSteUen: 

(c) eine Audioeingabe zum Err ofangen von 25 
Sprachsignalen und einc^ Fi r^auenzanaly- 
seeinrichlung (10. 11. 12) weicne auf derar- 

tlge Signale anspricht z jm Erzeugen von 
Sequenzen spektraler Paramett:r: 
id) ei*:© Einrichtung <13>. die ome Tabelle 30 
speichen, wolche spektrale ^arameterse- 
quenzon nut Codeworten ir Beziehung 
setzt. wobei Munddaienbloc- e Oder Se- 
quenzen davon identitiziert weirien; 
(e) eine Steuereirirlcntun n. die auf die spek- 35 
tralen Parameter anspric-;.t, un^ fur eina Ans- 
gal>e die entsprecrend- n Ccieworte Oder 
Codewortsequenzen voi ler Tabelle auszu- 
wahlen. 

40 

11. Gerat nach Anspruch 8. ^ der .0. weiterhin 
mit einer Datenblocsi^oeic einnct-i jng (100) 
zum Empfangen ur - Spe nern v .n Oaten, 
weiche einen Datent-iock de tildes carstellen: 

45 

eine Einrichtung O zut petitiven Ansle- 
son de? Datenbloc- ^ic* rum Erzeugen 
einec Videosignals ' 

eine Steuereinrichtu - .10- velche angeord- 50 
np» ist. um im Betn* ;ie ^ewahtten Code- 
worte ru cmpfang' vd »ntwort auf jedes 
Codewort den e ^ci. en Munddaten- 
biock auszulesen iin - .lugen dieser Da- 
ten in die Daien. - Leseemrichtung 55 
(103) bereitgesteir !en - bewirken. 
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